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Abstract 

The development of accurate and reliable dynamical modeling procedures that describe the time evolution of gene 
expression levels is a prerequisite to understanding and controlling the transcription process. We focused on data from DNA 
microarray time series for 20 Drosophila genes involved in muscle development during the embryonic stage. Genes with 
similar expression profiles were clustered on the basis of a translation-invariant and scale-invariant distance measure. The 
time evolution of these clusters was modeled using coupled differential equations. Three model structures involving a 
transcription term and a degradation term were tested. The parameters were identified in successive steps: network 
construction, parameter optimization, and parameter reduction. The solutions were evaluated on the basis of the data 
reproduction and the number of parameters, as well as on two biology-based requirements: the robustness with respect to 
parameter variations and the values of the expression levels not being unrealistically large upon extrapolation in time. 
Various solutions were obtained that satisfied all our evaluation criteria. The regulatory networks inferred from these 
solutions were compared with experimental data. The best solution has half of the experimental connections, which 
compares favorably with previous approaches. Biasing the network toward the experimental connections led to the 
identification of a model that is only slightly less good on the basis of the evaluation criteria. The non-uniqueness of the 
solutions and the variable agreement with experimental connections were discussed in the context of the different 
hypotheses underlying this type of approach. 
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Introduction 

Dynamical modeling of transcriptional regulation networks is an 
important goal of systems biology. It holds promise to understand 
the functioning of these networks as well as their malfunctioning, 
which can aid rational modification of some targeted properties. 
This goal is expected to be withm reach due to the impressive 
amount of data generated during the last few years by powerful 
high-throughput technologies, such as DNA microarrays that 
provide the simultaneous expression levels of many or even all 
genes in a cell sample [1,2]. Moreover, time series of DNA 
microarray data yield information about the evolution of gene 
expression levels during, for example, the developmental stages of 
the host organism, the response to external perturbations, or the 
cell cycle. If these time-dependent data were accurate and 
numerous enough, they would, in principle, allow the reverse- 
engineering of the transcriptional regulation network (see e.g. [3- 
13]). However, the mathematical model structure to be used for 
that purpose is unknown. Additional issues are the non-uniqueness 
of the parameters of the model (see e.g. [14]), the usually high level 
of intrinsic noise of the microarray data, and the impurity of the 
samples that often contain mixtures of cell types. A possibility to 
handle the degeneracy of the solutions is to include biology-based 
constraints in the modeling procedure [13]. One constraint is the 
robustness of the solutions with respect to parameter variations 
(see e.g. [13,15-17]). It manages stochasticity and ensures that the 



overall behavior of biological systems does not vary with changes 
in the environment, except when large and specific perturbations 
come into play that lead the system to another state. The second 
biological constraint consists of requiring that the solutions be 
stable when extrapolated in time. It is indeed reasonable to assume 
that although the expression levels may drastically change in time, 
up to a few orders of magnitude, they do not become unreasonably 
large. 

Another issue is the extremely large size of the transcription 
regulation network, where basically all genes of an organism are, 
directly or indirectly, connected. Even when large DNA micro- 
array time series are available, the data are insufficient to identify 
all parameters of the model structures. Moreover, oftentimes 
different genes exhibit similar expression profiles, either because 
they are coregulated or because the noise level does not allow 
distinguishing them. To solve both of these problems, genes are 
often clustered into groups, and the modeling procedure is applied 
on these rather than the individual genes [9,11,13,18]. The 
disadvantage of this approach is the lack of straightforward 
physical interpretation of the resulting gene cluster networks. 
Another approach is to consider the fuU transcription network of 
an organism as separate subnetworks that are loosely connected 
and can be modeled separately to a good approximation [19-22]. 

We consider in this paper Drosophila melanogaster as our model 
organism, and focus on the subset of 20 genes that are involved in 
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Table 1. Effect of the preprocessing procedure 


and clustering algorithm on the qua 


llty of the clusters. 




Preprocessing 


Classification 


Intraclass 


intraclass 


Interclass Interclass 


procedure 


method 


<D> 


<D,,,> 


<D> 


<D,,„> 


Filtering 


tree-like 


0.43 


0.39 


1.05 


1.06 




k-means 


0.41 


0.31 


1.01 


1.08 


Smoothing 


tree-like 


0.33 


0.29 


1.00 


1.00 




k-means 


0.29 


0.23 


0.94 


1.00 


Filtering & 


tree-like 


0.31 


0.29 


0.94 


0.99 


Smootliing 


{(-means 


0.28 


0.23 


0.95 


1.03 


The number of classes is set to 1 0. The optimal procedure is indicated in bold. Intraclass <D>: 
average distance between members of the classes and their representative member; Interclass 
Interclass average distance between representative members of different classes. 


average distance between members of the classes; Intraclass <Dn.p>: 
<D>: average distance between members of different classes; 
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muscle development during the embryonic stage. This subset has 
the advantage of being well described and of having available 
experimental data about the transcriptional interactions. To tackle 
modeling, we use a combination of the approaches described in 
the previous paragraph: we disregard the connections with genes 
outside this network, and cluster the genes that have similar 
expression profiles into classes. We then proceed to model the 
dynamical behavior of gene cluster expression, using coupled 
differential equation. To reduce the number of solutions and select 
tliose that have a biological meaning, we impose the robustness 
and stability constraints described above. The resulting transcrip- 
tional networks are compared to the experimental information 
about the transcription factor-gene interactions. 

Methods 

Experimental data on Drosophila genes involved in 
muscle development 

A total of 20 genes were identified as being involved in Drosophila 
muscle development [3]. These are: CG10293 (how), CG1429 
{mef2), CG17927 (mhc), CG18251 (msp-300), CG1915 (sis), 
CG2096 (flw), CG2328 (eve), CG2956 (twi), CG3992 (srp), 
CG4376 (actn), CG4677 (Imd/gfl), CG4889 (wg), CG5596 (mlcl), 
CG5939 (prm), CG7107 (up), CG7438 (myoSlDF), CG7445 (fln), 
CG7895 (tin), CG9155 (myo61F), CG9885 (dpp). 

The time-dependent expression profiles of these 20 genes during 
the embryonic development, relative to their expression in a 
reference sample containing a standard mixture of cells at all 
developmental stages, have been experimentally characterized by 
DNA microarray techniques [23]; they have been deposited in 
NCBI's Gene Expression Omnibus [24] and are accessible 
through GEO Series accession number GSE4347 (http://www. 
ncbi.nlm.nih.gov/ geo/ query/ acc.cgi?acc = GSE4347). The DNA 
microarray technique [1] proceeds by extracting the mRNA from 
the cell sample of interest and from the reference sample, reverse 
transcribing them into cDNA, labeling them by two types of 
fluorophores, and letting them hybridize to their complementary 
sequences attached to a microarray. The fluorescence intensities 
emitted by the fluorophores from the sample of interest are 
measured relative to the intensities emitted by the fluorophores 
from the reference sample; the index fi labels the mRNA 
molecules (or equivalently, the corresponding genes or proteins). 
Here we have 1,...,20. These intensities must be normalized to 
correct for different effects including the unequal quantities of 
RNA copies, cUfferences in labeling or detection efficiencies 



between the fluorescent dyes, and systematic biases in the 
measured expression levels [25,26]. The gene expression levels 
X^, are given as the ratio of the normalized intensities and 7^, 
under the commonly made assumption that the RNA concentra- 
tions and fluorescence intensities are proportional [27]. Time 
series correspond to gene expression levels of the sample taken at 
different time points T,- (/ = l,...,N). Here the time series contain 
= 3 1 time-points and cover the 24 hours of the embryonic 
development, with varying sampling frequency (every 30 minutes 
up to time point 14 and then every hour). The time-dependent 
gene expression profile X^(t) is thus defined as: 



Ml 



(1) 



The droID database [28] lists the interactions between genes 
and gene products. For the 20 genes involved in Drosophila muscle 
development, 36 experimentally proven interactions are listed. 
These include 34 interactions between transcription factors and 
genes, and 3 genetic interactions, defined as interactions whose 
molecular mechanism is unknown or results from a cascade of 
interactions [29]. These interactions are listed in Table SI of File 
SI. This table also contains four new interactions that were 
unknown when this work was started. They are used for validation 
purposes. Note that we overlooked protein-protein interactions 
because most of them are not directly obtained from experiment; 
rather, they are predicted from results on other species, and are 
thus less reliable. 

Clustering of gene expression profiles 

The expression profiles A'^(t) that have a similar shape are 
undistinguishable for modeling purposes, and we therefore cluster 
them into groups. However, these profiles present a high noise 
level and missing values. To aUeviate this drawback, we first 
preprocess the data. Two methods were tested. The first consists of 
data filtering with a mobile mean procedure: 
X,,(t,)^X^(t,)/2 + 1',,(t,_i)/4 + 1';,(t, + i)/4; when some values 
are missing in this equation, they are replaced by the neighboring 
values. The second procedure consists of smoothing, using the 
cubic splines algorithm csaps of Madab (The MathWorks Inc., 
Natick, MA), with parameter value p = 0.999 so that the 
interpolated curve follows very closely the experimental points 
[9] . To cluster these preprocessed expression profiles, we need to 
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Figure 1 . Drosophila muscle gene expression profiles belonging 
to a cluster. The 6 members of the cluster are listed at the top of the 
figure. The results for all clusters are given in Figs S1-S2 of File SI. (a) 
Filtered and normalized gene expression profiles contained in the 
cluster; the representative profile is depicted in bold, (b) Expression 
profiles superimposed onto the representative profile by translation 
and scaling; the average profile is depicted in bold. 
doi:1 0.1 371 /journal.pone.0090285.g001 



N 



J2[M^k)-ix„yf (3) 



On the basis of this distance measure, the 20 genes were 
classified into groups displaying similar expression profiles, using 
two distinct clustering procedures, the k-means algorithm and a 
tree-like hierarchical clustering algorithm [31]. The latter proceeds 
by considering all profiles in separate classes and grouping them 
two by two, in such a way that the average distance between any 
pair of profiles in each class is minimum. The procedure is stopped 
when a threshold distance or a maximum number of classes is 
reached. Each of the C clusters so obtained, labeled by c 
[c = 1,...,C), is represented by its normalized average profile, Xdi). 
To compute this profile, we first identified the representative 
profile of the cluster, defined as the profile for which the distance 
with respect to all other members of the class is minimum. AH the 
profiles of the cluster were then superimposed on the represen- 
tative, using the translation and scaling factors that minimize the 
distance. The average profile corresponds to the average, at each 
time point, of all translated and scaled profiles in the cluster. This 
average profile is then normalized, by scaling and translation, so as 
to ensure that the new standard deviation of each profile is 1 and 
that the minimum expression level of all clusters is 0. 

Model structures 

We assumed the system to be autonomous, and considered 
coupled diflFerential equations with a transcription term and a 
degradation term: 



x,(t)=®c(r)-A,(x)Xc(t). 



(4) 



where X = {Xi,...,Xc) and t is the real, continuous time. The dot 
means the derivative with respect to t. Since the transcription term 
&c(X) is defined to be positive, it increases the concentration X^ of 
cluster c, basically through the binding of transcription factors, 
which either activates or represses genes in this cluster. The 
positively defined function Ac(X), called degradation factor. 



define a similarity measure and a clustering procedure. Several 
distances between expression profiles can be defined [30]. Since 
the expression levels Xi^(t) to be modeled are relative levels with 
respect to a gene-dependent and time-independent factor eq. (1), 
no difference should be made between X^(z) and aX^{z), where a 
is an arbitrary positive real number. Moreover, we chose not to 
take into account the difference between two profiles with the 
same shape but different average expression levels, as such profiles 
are merely translated with respect to each other. W e thus require a 
symmetric, translation-invariant and scaling-invariant distance 
measure with zero scaling dimension: Va,AelR : Z)(j¥^,Xv + = 
DiXf„X,),D(Xt,,aX,) = D(Xi„X,), and DiXi„X,) = D(X„Xf,). 
The distance satisfying these constraints has the form [30]: 



D(X^,X,)-- 



1 



Wu-)-<x,y\ |jr„(TA-)-<^v>r ' 



in terms of the mean '(X^y and standard deviation ^^,: 



(2) 
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Figure 2. Value of the objective function ;t,„„v as a function of 
the connectivity q, for different model structures. The results 
obtained with the model structure m"Ji are represented by a solid line 
with triangles, with m^*^ by a dotted line with crosses, and with m^j^ by 
a dashed line with dots. The circled points indicate the solutions 
selected for parameter reduction. 
doi:1 0.1 371/journal.pone.0090285.g002 
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Table 2. Characteristics of the full and reduced solutions using the model structure wj^ 



Model 


q 


Solution 


a 






'/. 


NC' 


PC^ 


AC' 




2 


full 


0.29 


0.29 


0.43 


3.02 


20 


5/1 7(29%) 


68/83 (82%) 






reduced 


0.27 


0.27 


0.30 


13.78 


20 


5/1 7(29%) 


68/83 (82%) 




3 


full 


0.28 


0.29 


0.43 


3.01 


30 


7/17(41%) 


60/83 (72%) 






reduced 


0.28 


0.29 


0.43 


0.95 


20 


4/1 7(24%) 


67/83 (81%) 






full 


0.15 


0.15 


0.43 


3.01 


40 


8/1 7(47%) 


51/83 (61%) 






reduced 


0.20 


0.21 


0.39 


2.77 


29 


8/17(47%) 


62/83 (75%) 


Biased m'^'f^ 


3 


full 


0.45 


0.46 


0.73 


1.68 


30 


17/17(100%) 


70/83 (84%) 






reduced 


0.31 


0.33 


1.12 


3.21 


27 


17/17(100%) 


73/83 (88%) 



The last two lines contain the solutions biased towards the experimental network. The networks corresponding to the solutions In bold are depicted in Fig. 4b-c. 'NC : 
number of connections in the estimated network; "PC: fraction of these connections that are among the 1 7 experimentally verified connections {see Table SI in File SI ); 
^ AC: fraction of the non-connections that are not among the 1 7 experimentally verified connections (thus that are among the 1 0 x 1 0-1 7 = 83 experimental non- 
connections). 

doi:1 0.1 371 /journal.pone.0090285.t002 



describes the degradation, destabUization or activity inhibition of 
the gene products belonging to cluster c, or their removal from the 
system. We used the model structure proposed in [13] for the fuU 
Drosophila gene expression time series, as it showed to be flexible 
and to lead to good results: 



1 + -f exp 



0c.(X) = 



A.(X) = 



d=l 











1 +exp 






















K + + exp 


- E KMt) 






. d=\ 










1 +exp 




E KMt) 





(5) 



with , , 1^ , >Q. For defining this structure, it was 
assumed that the transcription term and degradation factor are 
modulated by interactions between genes and/or gene products. 
For the transcription term, these interactions represent the binding 
of activating or repressing transcription factors as well as the whole 
cascade of protein-protein interactions occurring before the 
binding of the transcription factors. For the degradation term, 
these interactions tend to either prolong ( e.g. through stabilizing 
complexes) or shorten ( e.g. through degradation by proteases) their 
period of activity. The parameters and (1^ and l^T) 
symbolize the maximum and minimum degradation rate (tran- 
scription rate) when >K^ (X'^ >^7) ^""^ converse when 
<K^ {X^ <X^), and Kci/ and Lai give the influence 
(stabilizing or destabilizing according to their sign) of gene 
(product) d on gene (product) c. 

Two other model structures were also tested, which are 
particular cases of the first. These are: 



<'^:0,(X) = 



A + + X^ exp 


- E L,dXAt) 

. d=l 


1 +exp 


- E LMt) 
. i/=i 





, A,(X) = y,, (7) 



obtained from (5) by posing Kcj = 0, =0 and = 27^. 

Parameter identification 

The gene expression network was built iteratively by increasing 
the connectivity q which is defined as the average number of 
connections ending at a node (or class). In a first stage, the number 
of connections was considered to be identical for all nodes. The 
procedure starts by considering q= I, and determines, for each 
node, the connection (defined by a series of parameters) that 
minimizes an objective function. It continues by incrementing q 
until it is large enough to get sufficiently small values of the 
objective function. 

A two-step procedure, based on two different objective 
functions, was used for parameter identification so as to manage 
the large amount of parameters and the non-linearity of the 
equations. The first step consists of constructing the network by 
reproducing the derivatives of the gene expression levels rather 
than the gene expression levels themselves. The objective function 
C(J), where J denotes genericaUy all parameters of the model, is 
thus the square root of the square difference of the measured and 
estimated expression level derivatives, summed over all time 
points: 



C(J) = 



I C ^ N 

c=l k=l 



(8) 



w3J:0,(X) = p„ A,(X) = 



K+ + K^. exp 


- E KMt) 


1 +exp 


- E K,dXAt) 

. d=\ 





, (6) 



which is obtained from (5) by posing Led = 0, X^ =0 and 
1+ = 2Pc, and 



This entails considering the expression levels and their 
derivatives as independent variables and reducing the identifica- 
tion to an algebraic problem. Details of this procedure can be 
found in [13]. In the second step, the connections defined in the 
first stage for q=l, 2,... are maintained and the parameters of 
these connections are identified so as to minimize another 
objective function, expressed as a function of the difiFerence 
between measured and estimated profiles rather than their 
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Figure 3. Estimated and experimental expression profiles for a few clusters. The results for all clusters are given in Figs S3-S7 of File S1 . 
Dots: clustered, filtered and smoothed experimental data; (a)-(b): Estimated expression profiles for clusters {actn, wg, tin, dpp} and {fin} using the three 
model structures; dashed line: ; dotted line: ; solid line: m^^; {c)-(f): Estimated expression profiles for cluster {me/2} and {eve, (w/} using the 
model structure m^^^ and q = 1 (c), 9 = 3 (d) and 9 = 4 (e-f); solid line: before parameter reduction; dashed line: after parameter reduction using the*Pv 
procedure; (g)-(h): Estimated expression profiles for clusters [msp-300, sis] and {srp] using the model structure m"^^, q = 3, and the biasing procedure 
towards the experimental network; solid line: before parameter reduction; dashed line: after parameter reduction using the "Pv procedure. 
doi:l 0.1 371 /journal.pone.0090285.g003 



derivatives. Two variants of this objective function are considered, 
(TmaxQ) and (7(J). They are defined as: 



f»jflx(J) = max (T(.(J) and a{J) = 



A 



-£c7,(J)^ (9) 



point zic, and computing the value of the standard deviation a 
obtained with this perturbed parameter, denoted Operf, 

4) the stability of the solution, evaluated by extrapolating the 
estimated profiles up to a time T„„/ and by computing the 
difference between the average value of the estimated gene 
expression levels over the measuring period and the 
extrapolated level: 



where 



TciJ)- 



k = l 



(10) 



(11) 



The estimate of the gene expression profiles, Xc, is obtained by 
integration of the differential equations (4), using one of the model 
structures given by eqs (5-7), and the ode45 routine of Madab. 
The parameters J are identified so as to minimize either (r(J) or 
f^iiiuyO): using Matlab's fmincon optimization algorithm. The 
initial values of the parameters are set to those obtained for the q- 1 
identification, with the newly added parameters set to zero. 

Parameter reduction 

The next step consists of eliminating unnecessary parameters 
among L^rf and which appear in eqs (5-7), while requiring 
that at least one connection per gene class be kept. We proceed by 
dropping one parameter at a time at each step in the iteration, 
according to two criteria: 

1) the parameter of smallest absolute value; this procedure is 
referred to as *P„; 

2) the parameter which, when dropped, leads to the smallest 
increase of (Tmaxl this procedure is called ^P^. 

These criteria turned out to be more effective than those based 
on the Fisher information matrix [13]. After a parameter is 
eliminated the remaining parameters are optimized again using 
the local optimization algorithm fmincon from Matiab. The 
elimination procedure is then reiterated. 

Evaluation of tlie solutions 

Four criteria were used to evaluate the quality of the estimated 
profiles: 

1) the number or remaining parameters; 

2) the standard dc\ iations a and (7„,ax between estimated and 
experimental profiles, defined in eq. (9); 

3) the robustness of the solution with respect to perturbations of 
its parameters; this is estimated by adding to each parameter 
in turn + 1 % of its value, determining which perturbation 
leads to the largest deviation between measured and estimated 

expression levels, lydT^k) — ^ci'^k)\, for any cluster c and time 



The time T<,„d is taken to be 3 times the measured (embryonic) 
time span. 

Results 

Clustering of the gene expression profiles 

The raw data points representing the gene expression levels 
Xftii:) of the 20 genes involved in the embryonic muscle 
development of Drosophila, were first preprocessed to fill in the 
missing points and to decrease the effect of measurement noise, as 
described in Materials and Methods. Two procedures were tested, 
consisting of filtering and/or smoothing. Moreover, given that 
some of the expression profiles have very similar shapes and are 
thus basically indistinguishable, we proceed to cluster them into 
groups. Since the profiles are defined up to a gene-dependent 
factor (see eq. (1)), the distance used to evaluate the similarity is a 
translation and scaling-invariant measure- of scaling dimension 
zero, denoted D and defined in eq. (2). Two different classification 
methods were tested with this distance, i.e. k-means and a tree-like 
clustering algorithm (see Methods section). 

To choose the most appropriate clustering and preprocessing 
method, wc computc'd: 1) the a\XTagc distance D (d(;fincd by cq. 
(2)) between the members of the same class; 2) the average distance 
between members of different classes; 3) the average distance D 
between the representative member of a class (defined in Methods) 
and the other members; and finally 4) the average distance 
between the representative members of different classes. To have 
well-defined classes, the first and third distance measures that 
correspond to intraclass distances must be as low as possible, while 
the second and fourth distance measures must be as high as 
possible. 

The results are given in Table 1 for classifications into 10 classes 
and in Table S2 in File SI for 5-15 classes. Preprocessing the data 

by succc^ssivi'l)' filtering and smoothing decreases all tlu; distances 
in general, and decreases even more the intra- than the interclass 
distances. We thus selected this preprocessing procedure. The 
lowest intraclass distances and the highest interclass distances are 
sometimes obtained with the tree-like clustering procedure and 
sometimes with k-means, depending on the number of classes. 
However, k-means performs more often slightiy better and we thus 
selected it as clustering procedure. The choice of the number of 
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Figure 4. Gene regulation networlu. (a): Experimental connections; the dashed connections correspond to the 3 connections that were unknown 
when this work was performed; (b): Network obtained with the model structure and 9 = 4, after parameter reduction using the 'P, procedure; 
(c): Network obtained with the model structure m^J, q = 3 and the biasing procedure towards the experimental network, after parameter reduction; 
(b-c): the connections in blue are the experimental connections that have been predicted; the blue dashed connections correspond to the new 
experimental connections that have been predicted. 
doi:l 0.1 371/journal.pone.0090285.g004 



classes is somewhat arbitrary as we do not see a gap in the intra- 
and interclass distances when decreasing the number of classes 
(Table S2 in File SI). 

To choose the total number of classes, we were guided by: the 
concern for having basically indistinguishable profiles in the same 
class; different profiles in different classes; and a sufficiently small 
number of classes to ensure that the parameters can be reliably 
identified from the available data. The analysis of Table S2i File 
SI indicates that the average distance between m(;mbc'rs of the 
classes falls below 0.3 when clustering into 10 classes and more. A 
visual inspection of the superimposed expression levels in the 
classes confirmed that these 10 classes are well-defined (see Fig. SI 
in File SI). We thus fixed the number of classes to 10. This 
classification grouped together the g(;iies {mhc, mlcl, prm, up, 
myoSlDF, myo61F}, {msp-300, sis}, {actn, wg, tin, dpp) and {eve, twi); 
all other classes contain a single gene. 

The representative and other members of these 10 classes are 
shown in Fig. 1 and in Fig. SI of File SI. Each of these clusters 
labeled by c (c= 1,...,10) is represented by its normalized average 
profile, Xc{t), which is defined in the Methods section and is 
depicted in Figs. 1, and SI and S2 in File SI. 

Dynamical modeling of gene expression profiles 

The time-evolution of the 10 normalized average expression 
levels, Xc{x), of the 20 genes involved in muscle development was 
modeled using an autonomous model structure (eq. (4)) with three 
versions of the transcription term and degradation factor given by 

eqs (5-7). 

Because of the large number of parameters and the nonlinearity 
of the equations, it is impossible to have a reliable direct 
identification of all the parameters that define all the possible 
connections between genes. So, assuming that the real gene 
expression network is sparse, we first determine the necessary 
connections, assuming a constant connectivity q for all nodes (see 
section 2.4 for details). We start from g = 1 and increase it until the 
value of the objective function is sufficientiy small. Moreover, we 
use successively two objective functions, denoted ^ and cr, defined 
in eqs (8-9). The first is given as a function of the difference 
between the derivatives of the experimental and estimated profiles, 
and the second as a function of the difference between the 
experimental and estimated profiles. The first is used to define the 
important connections and the second to optimize the parameters. 

Moreover, two variants of the latter objective function are used: 
a which is the standard deviation between estimated and 
experimental profiles averaged over all classes, and CJ,„„^ which 
is the largest standard deviation of all classes (et]. (9)). Using the 
former has the drawback that some classes may be modeled very 
well and others very poorly. Using the latter ensures that all classes 
have (7c values lower than (7„,ax, and gives thus slightly more 
homogenous and satisfactory results. Hence we keep ffma-t as the 
objective function. 

The evolution of dmax during the network construction, from 
connectivity q=\ to ^ = 5, is shown in Fig. 2. Clearly, the model 
structure that has the largest number of parameters, wi^-y, in 
which neither the transcription term nor the degradation factor is 
constant, is superior to the other two structures. For this structure. 



the minimal connectivity required (for which the Cmax value is 
reasonably low) is equal to 2, but a connectivity of 4 yields much 
lower values of (J„,ax- Figs. 3a-b and Fig. S3 in File SI illustrate the 
superiority of in^^^: only this structure allows the correct 
reproduction of the data. We restrict ourselves to this model 
structure and connectivity in the following. 

To get rid of the unnecessary' connections, we proceed to 
parameter elimination. Two methods were tested: ^t,. and ^P^, 
whc're the eliminated paramet('rs are those of smallest absolute 
value or those that lead to the smallest increase of the objective 
function Grmx (see Methods). Note that when several parameter 
eliminations lead to the same value of C!„uix, the one that increases 
a the least was chosen. The elimination procedure was performed 
until a threshold valued of (J„,ax was reached. This threshold was 
set at 0.3, by visual inspection, so as to ensure a fair reproduction 
of the experimental profiles. Moreover, in order to mimic 
biological reaUty, we selected solutions that were robust against 
perturbations of the parameters; in particular we required the 
standard deviation ap„i<0.^ (see Methods). We did not put a 
threshold on the stabihty of the solution, estimated by x (eq. (10)). 

Note that these two characteristics, robustness and stability, are 
quite important for modeling biological systems. Indeed, all such 
systems have a stochastic behavior that depends, among others, on 
changes in the environment, the amount of biomolecules, their 
possible binding and function. However, these changes do not 
affect the main properties of the .system, which continues to give 
similar responses to similar stimuli. Only very large or very specific 
perturbations can bring the system out of its correctly functioning 
state and lead it to another state, which can be functional or 
dysfunctional, depending on the perturbation. It is thus very 
important that the models that simulate biological systems have 
the same properties, and thus do not yield very different solutions 
for similar parameter values. The other characteristic of biological 
systems is its stabihty. Even though the available data usually cover 
only a part of the system's life, it is reasonable to assume that the 
expression levels continue to be of the same order of magnitude, 
never becoming unrealistically large or negative. The same 
property is expected to be built in the model: the solutions must 
take realistic values until the next perturbation or developmental 
stage, or the end of the organism's life. 

The results of the elimination procedures and for q = 2-A 
are given in Table 2 and Table S3 in File SI. These two 
procedures gave comparable results for the data reproduction, and 
gives usually better results for the stability and robustness, 
especially for q = A. We will thus in the following only detail the 
results obtained with "Py. 

The estimated profiles are shown in Figs 3c-f and Figs S4-S6 in 
File S 1 . The number of connections of the reduced solutions vary 
between 20 (for q = 2-3) and 29 (for {q = 4) and, accordingly, the 
reproduction of the experimental profiles is somewhat better for 
q = 4: {a and dmax - 0.2) than for q = 2-3 (ff and a„ax = 0.3). Note 
that 20 seems to be the minimum number of connections needed: 
the reduction procedure applied on the q — 2 solution, with 20 
initial connections, fails to eliminate further connections. All the 
solutions show a fairly good robustness with respect to the 
variations of the parameters, with values of (7^cr( between 0.3 and 
0.4. In contrast, the stabifity upon extrapolation in time differs 
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among the solutions. The best reduced solution is that obtained 
from g' = 3 {x = 0.9), followed by the q-4 solution (z = 2.8); the 
q = 2 solution is not stable at all (x = 13.8). 

Comparison with the experimental network 

The connections between gene clusters obtained by our 
dynamical modeling procedure can be compared to the experi- 
mentally determined connections, described in section 2.1 and 
Table SI in File SI. For this comparison, we first have to 
transform the experimental network between individual genes into 
a network between gene clusters. This is done by considering two 
clusters as being connected if at least one member in each cluster is 
connected to at least one member in another cluster. The 
experimental cluster network so defined contains 1 7 connections; 
it is shown in Fig. 4a. Note that these interactions are oriented but 
unsigned, as the sign is not experimentally determined. 

The intersection between the experimental and modeled 
networks is indicated in Table 2. The smallest intersection is 
obtained with the reduced q = 3 solutions (only 4 out of 1 7 
reproduced connections), whereas the largest intersection is 
obtained with the full and reduced q = 4: solutions (8 out of 17, 
which amounts to 47%). The number of experimentally non- 
observed connections that are also absent in the modeled solutions 
is of course much larger (between 61% and 82%). 

The number of common connections between the experimental 
network and the difierent solutions can be compared to the 
number of common connections that are expected at random. The 
most significant result is found for the (/ = 4 r<;duced solution: the 
probability of finding 8 common connections among two sets of 29 
and 1 7 connections each, out of a total set of 1 00 connections, is 
equal to 0.048. The same value is found for the absent 
connections. 

The network corresponding to this best solution {q = 4: reduced 
solution) is depicted in Fig. 4b, with the correctly reproduced 
connections highlighted. Among the 8 connections that are in 
agreement with experiment, four involve the gene mef2 (myocyte 
enhancing factor 2). In particular, the subnetwork involving the 
genes and gene clusters mef2, srp, Imd, how, {eve, twi} and {actn, wg, 
tin, dpp) is in good agreement with experiment. 

Note moreover that all existing functional connections have not 
yet been determined experimentally. Therefore some of the 
predicted connections may in fact be real ones. This is indeed the 
case: among the four new connections that were unknown when 
this work was performed (indicated in Table SI in File SI), which 
correspond to three new connections between clusters (see Fig4a), 
two were actually predicted, as shown in Fig. 4b. These correcdy 
predicted connections involve mef2, which supports the conclusion 
that this region of the network is well reproduced by our model. 
Adding these new connections increases the number of correcdy 
predicted connections to 10, out of a total of 20 experimental 
connections. 

Our gene regulation network connects gene clusters rather than 
individual genes and is thus of a different type than the networks 
obtained with other methods. Nevertheless, the fraction of 
correctly predicted connections, either between genes or gene 
clusters, can be compared among the different methods applied to 
the same ensemble of Drososphila muscle development genes 
[19,20]. These methods do not reach our ,^0% score. 

Biasing towards the experimental network 

To analyze if it is possible to find solutions that reproduce the 
data well but are different from those obtained in the previous 
section and are more consistent with the experimental data, we 
perform a biased modeling procedure. This procedure follows the 



same two steps: first the construction of the network from q — I to 
higher q by minimizing the cost function 4 (eq. (8)), and then the 
minimization of the cost function (T^ax (eq- (9)) while keeping the 
same network. However, here, instead of allowing a free choice 
among aU possible connections, the choice was biased towards the 
experimentally proven connections: if, for a gi\en cluster, 
experimental connections do exist and have not yet been included 
in the network in a previous step, the choice is limited to those; 
otherwise the choice is free. Moreover, in the parameter reduction 
procedure, parameters involved in the experimental connections 
may be eliminated but the connection may not be dropped 
entirely. 

The results obtained with this procedure are given in the last 
two lines of Table 2 and in Figs 3g-h and Fig S7 in File SI. The 
reproduction of the expression profiles is somewhat less accurate 
than with the unbiased method but remains good, with a = 0.4 and 
0.3 for the full and reduced solutions starting at ^ = 3. Note that 
the optimization of the solutions performs less well with some 
imposed connections, as u is higher for the fuU than for the 
reduced solution. The robustness of the reduced solution is also 
somewhat less good than for the unbiased procedure {(Tpert =1-1), 
whereas the stability is similar. Note that the total number of 
connections in the reduced solution is equal to 27. and that we had 
to add 10 connections in addition to the 17 experimental ones to 
reach a reasonable accuracy in the profile reproduction. However, 
the number of 27 connections is comparable to the number of 
connections obtained with the unbiased procedure (20-29). Note 
that among the three new experimental connections that were not 
imposed in this procedure, one appears to be correcdy predicted. 

Conclusion 

One successful result of our work is the consistent construction 
of dynamical models, on the sole basis of the gene expression 
profiles of the genes involved in Drosophila muscle development 
obtained from DNA microarray series. The models obtained 
reproduce the expression profiles r|uitc' ^\(;11, are robust against 
parameter variation and do not take unrealistically large values 
when extrapolated in time. However, our results present two 
important drawbacks. First, the solutions are not unique, and 
different networks are obtained with very similar data reproduc- 
tion abilities. The additional requirement of robust and stable 
solutions filters out some of them, but the number of acceptable 
solutions remains high. Second, half of the experimental connec- 
tions are obtained by the best of our unbiased models. To obtain 
all experimental connections, we had to bias the model 
construction towards the experimental network. 

The amount of 50% of the experimental connections found by 
our models compares quite favorably with the results of other 
analyses. However, our solution is still far from perfect and it is 
worthwhile to question the basics of our approach. We made a 
number of assumptions that, although commonly made in 
biological modeling, could explain the limited overlap between 
estimated and experimental connections. These assumptions are 
detailed hereunder. 

• We considered only the 20 genes known to be involved in 

muscle development of Drosophila. In reality, these genes are 
connected to other genes. We suppose here that these additional 
connections are not (or much less) important for the transcrip- 
tional regulation of these 20 genes. More generally, we disregarded 
the effect of aU external factors on the regulation of these 20 genes, 
which is quite a bold (but common) assumption. 

• As some of these 20 genes have similar experimental 
expression profiles, which are moreover quite noisy, we prepro- 
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cessed the data by filtering and smoothing them, and grouped the 
similar profiles together using a k-means classification procedure 
and a translation-invariant and scaling-invariant distance measure. 
Although we tested several preprocessing and classification 
procedures and although the selected ones appear to perform 
quite well, we cannot be sure that the noise is eliminated in the 
right way, and that the clusters are formed adequately. 

• As some genes were clustered together, the transcriptional 
model we derive connects gene clusters instead of individual genes. 
The interpretation is that when a cluster is found to regulate 
another cluster, some of their members do so, but not necessarily 
all. It is indeed possible that the different members of a cluster - 
even though their expression profiles are similar - are not 
regulated by the same gene (cluster). The different - almost 
equivalent - solutions in terms of data reproduction and 
robustness that we found could well reflect the differences in 
connections between individual members of the clusters. 

• Another possibility is that the DNA microarray data, and thus 
the networks predicted from them, correspond to external 
conditions that differ from those of the experimental inter-gene 
connections. It has for example been shown that networks may be 
different when responding to different types of stress [32]. 

• The model structure we selected is quite flexible and gives 
good results in terms of data reproduction; it could however be 
argued that it does not mimic the biological mechanism and that 
another model structure should be used. 

• The experimentally determined interactions are listed as 
regulatory interactions. However, some of them could be involved 
only indirectiy in regulation, and others could be side actors. 
Moreover, not all interactions are known today, and some of the 
predicted interactions - or of the non-predicted ones - wiU 
perhaps be experimentally demonstrated in the future. 

It is difficult at this point to identify the reason for the limited - 
though substantial - overlap between experimental and estimated 
connections and of the large number of almost equivalent 
solutions. Note that the latter result could be taken as a positive 
result that mimics reality. Indeed, gene regulation networks have 
been shown experimentally to display some elasticity [33]. 
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